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Abstract Smoothed analysis is a framework for analyzing the complexity of an algorithm, acting as a bridge 
between average and worst-case behaviour. For example, Quicksort and the Simplex algorithm are widely used in 
practical applications, despite their heavy worst-case complexity. Smoothed complexity aims to better characterize 
such algorithms. Existing theoretical bounds for the smoothed complexity of sorting algorithms are still quite 
weak. Furthermore, empirically computing the smoothed complexity via its original definition is computationally 
infeasible, even for modest input sizes. In this paper, we focus on accurately predicting the smoothed complexity 
of sorting algorithms, using machine learning techniques. We propose two regression models that take into account 
various properties of sorting algorithms and some of the known theoretical results in smoothed analysis to improve 
prediction quality. We show experimental results for predicting the smoothed complexity of Quicksort, Mergesort, 
and optimized Bubblesort for large input sizes, therefore filling the gap between known theoretical and empirical 
results. 

Keywords Smoothed Complexity • Sorting Algorithms • Machine Learning • Regression Models 


1 Introduction 

Smoothed Complexity (SC) was first introduced in ( [Spielman and Ten^|2QQl| ), aiming to provide a more realistic 
view of the practical performance of algorithms compared to worst-case or average-case analysis. Motivated by 
the observation that, in practice, input parameters are often subject to a small degree of random noise, SC measures 
the expected performance of algorithms under slight random perturbations of the worst-case inputs ( [Spielman and| 
|Teng|[2QQ6| ). When worst-case is extremely rare in practice, a worst-case view can be problematic, especially for 
algorithms that have poor worst-case, but good average-case complexity. Average-case analysis is an important 
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complement to worst-case analysis, providing a more comprehensive view of the problem. Nevertheless, SC, a 
hybrid of worst-case and average-case, provides an alternative measurement for a given algorithm ( [Spielman and| 
|Teng|[2QQ^ . In practice, it is useful to understand how quickly the SC switches from worst to average case. By 
analyzing the worst-case inputs under perturbations, the SC intuitively indicates the probability of an algorithm 
encountering worst-case in practice. If the SC of an algorithm is low, then it is unlikely the algorithm will take a 
long time to solve practical instances, even if it has a poor worst-case complexity. 

Although very useful, SC is not easy to estimate, neither theoretically nor empirically. The SC bounds of 
many algorithms have been given, including the Simplex Algorithm, that has exponential worst-case complexity 
but polynomial SC ( [Spielman and Ten^|2QQl[[Deshpande and Spielman||2QQ5|), Quasi-Con cave Minimization, an 
NP-hard problem, but with polynomial SC under certain conditions ( [Spielman and Tengj [2QQ9j), and Quicksort, 


but SC of 0{^lnN), where p G [0,1] ([Banderier et al 


20031. However, 
bounds are often 


with worst-case complexity 0{N'^ 

theoretical approaches to bound the SC generally require very complex proofs', and the resulting 
weak. For sorting algorithms, ( [Schellekens et aL\ [2014[ ) has shown that the gap between the exact (empirical) 
value of Quicksort’s SC and its known bound is significant. 

Recently, modular smoothed analysis ( SchellekSi^ [2008|) has been introduced to better estimate the value 


of SC for discrete cases. In ( [Schelleken^ 


2QQ8| ) it ws shown that for an algorithm that satisfies random bag 


preservation, its SC value can be calculated through a recurrence equation. Although more accurate than the the¬ 
oretical bounds, modular smoothed analysis currently works only for Quicksort and its median-of-three variant 
(M3Quicksort) ( [Schellekens et al\ [2Q14[ [Hennessy and Schellekens| [2Q14[ ). Furthermore, because of its recur¬ 
rence structure, the maximum input size for which it is feasible to compute the SC of Quicksort is 3000, and for 
M3 Quicksort it is 130. 

An alternative to the above approaches, is to try to compute the SC directly, using its original definition. For 
discrete cases, the SC under partial permutation perturbations is the maximum average runtime over all perturbed 
inputs ( [Spielman and Ten^|2006[ [Schellekens et aL\\20l4\ . However, the perturbation step leads to a very heavy 
computing process. As shown in Figure [T] to calculate the SC of an input list with length N, under partial permuta¬ 
tions given perturbation parameter K (the degree of perturbation), we need to generate a perturbed group for each 
input list, take the average runtime over each perturbed group, and take the maximum of these. The complexity 
for empirically computing the SC of Quicksort is then 




{N\)^N\og{N), 
{N-K)\ ^ 


( 1 ) 


quickly becoming infeasible even for very small inputs (i.e., size N of list to be sorted). 

Our Contribution. In this paper, we show how using machine learning techniques to predict the value of SC 
for sorting algorithms overcomes the difficulties raised by either a theoretical or an empirical approach. This is 
a new point of view, since apart from theoretical bounds and some empirical results for small input sizes, there 
is very little information on how the SC of sorting algorithms behaves exactly. We formulate the SC prediction 
as a regression problem, and present two successful predictive models. Model TLR-SC (Transformed Linear 
Regression for SC) turns the non-linear relationship of our selected features into a linear one, and delivers good 
prediction by simply using linear regression. Model NLR-SC (Non-linear Regression for SC) turns a surface 
fitting problem into multiple curve fitting problems, and by predicting the smoothed complexity curve by curve, 
gradually predicts the entire surface. Because NLR-SC takes advantage of the theory of smoothed analysis, it 
delivers good results with very few training examples. The initial learning models are built for Quicksort, but 
easily adapted to other sorting algorithms, e.g., M3Quicksort, optimized Bubblesort and Mergesort. Previously, 
there were no known results on the SC of the latter three sorting algorithms, for large input sizes. We believe the 
results in this work are useful for characterizing the behavior of sorting algorithms in practice, and general enough 
so that they could also be adapted for other interesting algorithms. 

Many machine learning algorithms have been analyzed by smoothed analysis, such as k-means clustering 
( [Arthur etal\\2D()9\ and Support Vector Machines ( [Blum and Dunagan] [2002[[Spielman and Teng[[2QQ9] ), however. 
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Given Input List Length N 

Listi, Listj, List3, 


Step 1 
Generate N! 
Perturbed Groups 




Step 2 

Sort Lists In All Groups (Algorithm 
And Return Runtimes 

III 

2 AvgRuntimei AvgRuntime2 AvgRuntime3 

Average The Runtimes ^ 

And Find The Maximum 


Smoothed Complexity 



I 

AvgRuntimen! 


Fig. 1: Steps for computing the SC (maximum average runtime) of sorting algorithms empirically. 


to the best of our knowledge, there is no previous research on using machine learning algorithms to predict the 
SC of sorting algorithms. Due to recent modular smoothed analysis results, we can compute the SC of Quicksort 
exactly, for medium size inputs (e.g., N < 3000). For other sorting algorithms, currently the SC can only be 
computed using its original definition, and thus, due to computational requirements, only for small input sizes 
(e.g., N < 100), limiting our understanding of the general SC behaviour. This also means that we are faced with 
a lack of ground truth data to train machine learning algorithms for predicting the SC. To this end, in this work, 
we use a combination of modular smoothed analysis and empirical approaches to generate ground truth data. We 
formulate predicting the numeric value of SC of sorting algorithms as a regression problem. The techniques to 
handle regression, curve fitting and surface fitting problems in the machine learning area are quite mature ( |Hastie| 
\et (2/.||2QT3] ). The gist of this work is how to gather ground truth data, identify good features and build appropriate 
learning models, so that by training on behaviour data of sorting algorithms on small inputs (where it is relatively 
easy to gather ground truth data), we can accurately predict the SC of the sorting algorithm for large inputs. 


2 Discrete Smoothed Complexity 


While worst-case complexity refers to the maximum running time of an algorithm acting on every input, the SC is 
a smoothed version of worst-case complexity, that considers the maximum average running time of an algorithm 
acting on the perturbations of every input. The degree of perturbation is measured by a parameter cr. SC can be 
explained as a function of cr which interpolates between the worst-case and average-case running times. When <7 
goes to 0, then the SC is equal to the worst-case complexity; whereas, if a goes to 1, then the SC becomes the 
average-case complexity, and in practice, it is useful to understand how quickly the SC switches from worst to 
average case. The quicker the SC switches, the more unlikely the worst-case appeares in practice. 

SC was originally defined for continuous cases using Gaussian perturbations ( [Spielman and Ten^|2QQl| ). The 
discrete version of SC was introduced in ( [Banderier et n/.|[2QQ3| ) and extended in (Schellekens et al. ]Tq 14| ). In 


this work, we use the partial permutation perturbation definition of (Schellekens et al. 20141, where cr (0 < 
K <N),N is the length of the input list: 
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Definition 1 A <7-partial permutation of S is a random sequence = (S'j, ... S'^) obtained from S’ = (S’!, S' 2 ,... Sn) 

in two steps. 

1. K elements of S are selected at random. 

2. Choose one of the Kl permutations of these elements (uniformly at random) and rearrange them in that order, 
leaving the positions of all the other elements fixed. 

Tableshows the partial permutations for A = 3 on the set of permutations of {1,2,3}. When A = 0, the list 
is not perturbed. When K = 1, the perturbed group contains the list itself, repeated N times. When K — N, the 
perturbed group is the set of permutations of size A, Y.n- 


Table 1: The partial permutations for A = 3 on the set of permutations of {1,2,3}. 


K=0 

K=1 

K=2 

K=3 

123 

123, 123, 123 

123, 123, 123, 132,213, 321 

123, 132,213,231,312, 321 

132 

132, 132, 132 

132, 132, 132, 123,312, 231 

132, 123,312, 321,213,231, 

213 

213,213,213 

213,213,213,231, 123,312 

213,231, 123, 132,312, 321 

231 

231,231,231 

231,231,231,213, 321, 132 

231,213, 321,312, 123, 132 

312 

312,312,312 

312,312,312, 321, 132,213 

312, 321, 132, 123,213,231 

321 

321,321,321 

321,321,321,312, 231, 123 

321,312, 231,213, 123, 132 


Having the partial permutation perturbation definition, we now give the definition of the discrete SC. 

Definition 2 ( [Schellekens et <2/.||2Q14| ) Given a problem P with input sequence domain let A be an algorithm 
for solving P. Let Ta{^) be the average running time T of an algorithm A on an input collection ^ C The 
SC, 7^(A,^), of the algorithm A is defined by; 

Tl{N,K) = maxseY.ATA(.PertK,N{S))) (2) 

where PertK^N{S) is the perturbed group of S under partial permutations, the degree of which is defined by K. 

In this work, the algorithms are comparison-based, therefore Ta (v) is measured as the number of comparisons 
A performs when computing the output on input x. 


2.1 Modular Smoothed Analysis 


Modular smoothed analysis was recently introduced in ( [Schellekens et al. \ |2Q14[ [Hennessy and Schellekensj 
|2Q14| ). It is a simplification of traditional SC analysis by smoothing out the perturbations over the computation. 
For algorithms that are random bag preserving ( |Schellek5i^ |2008| ), the number of comparisons of the algorithm 
running on an input can be captured and calculated through a recurrence equation. The two equations for the 
modular SC of Quicksort and M3Quicksort are shown below (Equations |3|4| ). Using these two formulas we collect 
ground truth data for Quicksort and its median-of-three variant for training and evaluating our supervised learning 
method. The modular recurrence equation for the SC of Quicksort, /(A, A), is (Schellekens et al | |20T4^ : 


f{N,K) = {N-\) + f^ pN+i-jfU - 1,^) + L PffU -hK), 


( 3 ) 


7=1 


7=1 


where 


N-K+l 
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and 




(^-1) 

N{N-\)' 


The recurrence equation for the SC of MSQuicksort is ( [Hennessy and Schellekens| [20141 ): 


/(;v, ^:) = (iv - 1 ) +1 + £* -1, ^:) + “ ‘/(i -1, (4) 

7=2 7=2 

where 


jiv-i 210--1) (JN-A\ {K+\) (N-A\ 3{N-j) (N-A\\ 

^ K{N-2)Q \ \K-2)^ {K-\)\K-3)^ {K-\) \K-A)y 


2<j<N-2, 


and 

K:! = {^'(x- l)2'(X-2)!(2X- 1) 

Modular SC values are closer to the traditional SC ones, as compared to the existing mathematical bounds 
( |Schellekens et aL\\20l4\ . 


3 Sorting Algorithms 


This paper focuses on analysing and predicting the SC of four sorting algorithms; Quicksort, M3 Quicksort, op¬ 
timized Bubblesort and Mergesort. Their worst-case, average-case and smoothed complexity are listed in Table 

E 


Table 2: The worst-case, average-case, and smoothed complexity (if known) of Quicksort, M3 Quicksort, opti¬ 
mized Bubblesort and Mergesort. 


Sorting Algorithm 

Worst-Case 

Average-Case 

Smoothed Complexity 

Quicksort 

M3 Quicksort 
Optimized Bubblesort 
Mergesort 

0(N^) 

0{N^) 

OiN^) 

0{NlogN) 

0{NlogN) 

0{NlogN) 

0{N^) 

0{NlogN) 

0(flog(iV)) 

NA 

NA 

NA 


M3 Quicksort is a variation of Quicksort. The classical version of Quicksort selects the first element of the list 
as a pivot, while M3 Quicksort first compares the first, the median and the last element, and selects as pivot the 
element whose value is the median of the three. By doing so, M3 Quicksort is 30% - 50% faster than the original 
algorithm. 

Bubblesort is simple to implement, and it is also easy to track its comparisons. The normal Bubblesort has a 
constant runtime for all inputs, and it is not desirable as a testing algorithm for the SC, which analyzes the transi¬ 
tion between the worst-case and the average-case complexity. Therefore, we focus here on optimized Bubblesort 
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( |Knuth|p~973| ). A tracker is added in Bubblesort to show whether or not elements were swapped, so the algorithm 
can stop running earlier if the list has been sorted. As a result, optimized Bubblesort works faster on part of the 
inputs and its average-case runtime will be smaller than its worst-case runtime, although its two complexities are 
on the same scale. 

Mergesort is the standard version. Note that although its worst-case and average-case complexities are on the 
same scale, its worst-case and average-case runtimes do not have the same value. 


4 Data Collection 

In this section we describe our process of collecting ground truth data for learning prediction models. All our data 
and code is available upon request for research purposes. 


4.1 Empirical Approach 


Although ([Banderier et al. 2003\ has proven that the SC of Quicksort is this bound is not accurate 


enough (see ( [Schellekens et <2/.||zQ14| ) for details) to allow us training a supervised machine learning approach. 
To obtain better ground truth data for the SC, we first use an experimental approach to compute the SC exactly, 
by following definitions in ( [Schellekens et (2/.||2Q14| ). The steps for calculating the SC of a sorting algorithm A for 
given input lists length N are (Figure [TJ: 

Stepl. Generate a perturbed group for each input list, under the partial permutation perturbation. Given the per¬ 
turbation parameter K, and following Definition the size of each perturbed group is 


K) Kl{N-K)l 


K\ = 


N\ 


{N-K)\ 


( 5 ) 


The total number of permutations in all perturbed groups is equal to the number of the input lists, which is A!, 
multiplied with the size of each perturbed group; 


{N-K)\ 


( 6 ) 


Step2. For all permutations in every perturbed groups, compute their runtime^^under sorting algorithm A. Denote 
the average-case complexity of A as A (A), then the complexity of this process is 




(A!)2-A(A) 

{N-K)\ 


a) 


Step3. Calculate the average runtime for each perturbed group, then select the maximum average runtime among 
all perturbed groups, i.e., the SC. Compared to Equation the complexity of calculating the average and the 
maximum is too low to be considered. Therefore the complexity of computing SC of A is the same as Equation]^ 


We can see how computationally heavy this process is. Even though we can store the runtime of the A! 
permutations into memory to avoid repetitive computation, the complexity of computing the SC is still 

D(A!-A(A)) (8) 

^ Runtime is the number of comparisons in our case, as we only consider sorting algorithms in this work. 
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For Quicksort, the complexity of computing the SC is 0{NlN\og{N)), while for Bubblesort, the complexity is 
0{N\N^). However, since empirical data is essential as a ground truth for a machine learning approach, we gener¬ 
ate some of our data with this approach. We compute the SC of Quicksort, M3 Quicksort, optimized Bubblesort, 
and Mergesort. Our code uses hill climbing and Quicksort-specific worst-case permutation results ( [Shi||2Q13] ), to 
push the input size for which we can empirically obtain SC values. Plots of the results of the four algorithms are 
shown in Figurefor list length N = 10, 2 <K <N. 

Limitations of the Empirical Approach: Through the experimental approach, we can obtain the SC of sort¬ 
ing algorithms for very limited N and K, due to the computational burden. For K — 2, wq can compute the SC 
for Quicksort only up to N — 200, and for K — N, only up to A/^ = 10. For algorithms that are less efficient than 
Quicksort, the data we can collect this way is much less. Such a dataset alone is too small to be useful for training 
and testing our learning system. Therefore, in the next section, we show how to use modular analysis to gather 
more data. 


4.2 Modular Smoothed Analysis 

Modular smoothed analysis ( [Schellekens et (2/.||2014| ) provides another way to calculate the SC for Quicksort and 
its median-of-three variant. More precisely, it gives recursive formulas, that are parameterized by the list length N 
and perturbation parameter K. Using the same amount of time as the empirical approach, we can collect a thousand 
times more data through the modular approach. For Quicksort, the maximum list length N for which the SC can 
be computed is 3000. For M3Quicksort, we can collect the SC for N up to 130.Unfortunately, when N > 130, 
the factorial calculation within the formula makes the computation unmanageable. In addition, the formula only 
works when K> 4, and the SC of ^ = 2,^ = 3 cannot be obtained through the modular approach. 

Limitations of the Modular Approach: Due to the recursive structure of modular analysis equations, this 
aproach also quickly becomes infeasible with increasing input N. Additionally, for now, modular smoothed analy¬ 
sis results are known only for Quicksort and M3 Quicksort, while we require ground truth data to validate learning 
models for more sorting algorithms, e.g., Bubblesort and Mergesort. 


5 Data Analysis 

Section discussed the collection of the ground truth data for building SC prediction algorithms, i.e., given 
an input list of length A, and a perturbation parameter K, we have computed the value of SC, for four sorting 
algorithms. For Quicksort and M3 Quicksort, we used the modular approach, and for Bubblesort and Mergesort, 
the empirical approach. Tableshows sample ground truth data for Quicksort. Figureshows the relationships 
between N, K and the SC of Quicksort. If the value of N increases, no matter what value K is, the SC increases as 
well. This is reasonable, since the execution time will be longer in general for sorting algorithms, when the input 
list length is greater. If the value of K increases, no matter what value N is, the SC decreases. Because the SC is 
the hybrid of worst-case and average-case analysis, if K increases, SC will tend from the worst-case towards the 
average-case behavior ( [Spielman and Ten^|2QQl| ). From Figures we can see similar patterns also exist 

for the SC of M3Quicksort, Bubblesort and Mergesort. 


5.1 Fixed N 

By fixing the value of N, we consider the relationship between the SC and K only. Figure shows how the SC 
of Quicksort decreases while K increases, for N = 10,100,500,1500, 2 < K <N. When K — 2, the value of the 
SC is very close to the worst-case complexity, and when K — N, the value is equal to the average-case complexity 
( [Spielman and Ten^|2001| ). Note that, the larger the value of N, the quicker the SC decreases while K increases. 
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Table 3; Sample ground truth data for the SC of Quicksort. 


N 

K 

SC 

10 

2 

39.7305 

10 

3 

35.6077 

10 

4 

32.4413 

10 



10 

10 

24.4373 

15 

2 

91.8248 

15 

3 

81.6957 

15 

4 

73.8442 


By looking at the Figure where N — 500, we can clearly see the tipping point where SC steadily turns to 
average-case complexity. Such point indicates how likely the algorithm encounters a worst-case in practice, and 
the quicker the tipping point appears, the better the SC of the algorithm. The data shape of the SC depends on 
the sorting algorithm, as shown in Figure The relationships between the SC and K of Quicksort, optimized 
Bubblesort and Mergesort are completely different. While K increases, the SC of Quicksort decreases quickest, 
the second is Mergesort, and the last is Bubblesort. This explains why Quicksort performs very well in practice 
although its worst-case complexity is 0{N^), and also shows how the SC encodes this behaviour. 


5.2 Fixed K 

With K fixed, we analyze the relationship between the SC and N. Figure shows how the SC of Quicksort 
increases when K — 2, and when K — N. The range of N is from 5 to 100. The value of the SC given K — 2, links 
to the worst-case behavior of the algorithm, which is 0{N^) for Quicksort, and similarly, given K — N,ii links to 
the average-case behavior, which is 0{N\ogN). 


5.3 Feature Selection 

Due to the definition of SC, when selecting features for designing our learning models, N and K are the first 
choice. N is the input list length and K is the perturbation parameter varying from 2 to N. In our experiments, 
we also found Runtime-based features to be helpful for predicting the SC. Runtime is the comparison time of an 
input list in a sorting algorithm, and MaxRuntime is the maximum Runtime among all input lists, given N. The 
average time complexity for computing Runtime of Quicksort over all inputs is 0{N\ogN). We define AvgRuntime 
as the value of average Runtime of an input’s perturbed group in a sorting algorithm. Note that the SC is the 
maximum AvgRuntime (MaxAvgRuntime) that can be found among all inputs. We have found that Runtime and 
AvgRuntime do not improve prediction, but the MaxRuntime does. For each N, there is only one MaxRuntime 
value. MaxRuntime is fairly easy to compute compared to the SC, and it can act as a scaling factor to indicate an 
appropriate SC value for the model. For Quicksort we compute MaxRuntime using the worst-case permutation. 
For the other 3 algorithms, we use hill climbing. Tableshows the definition of important terms for this section. 


6 Prediction Models for the SC 

In this section we present our evaluation methodology and our two approaches for predicting the SC of sorting 
algorithms. 
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Fig. 3: The SC of Quicksort where A/^ = 10,100,500,1500, 2<K<N. 


Quicksort Bubble Sort Merge Sort 



Fig. 4; Data shape of SC for varying K, for Quicksort, Bubblesort, Mergesort, N = 10. 
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N 

Fig. 5: The SC of Quicksort when fixing K=2 and K=N. 

Table 4: Summary of important terms/features for our prediction models. 


Terms 

Definition 

N 

Input list length 

K 

Perturbation parameter, varies from 2 to N 

MaxRuntime 

Maximum Runtime among all input lists of size N. For each N, there is only one 
corresponding MaxRuntime 

MaxAvgRuntime 

Maximum AvgRuntime among all input lists of size N, given K. For each N and K 
combination, there is only one corresponding MaxAvgRuntime, aka the SC 


6.1 Evaluation Metrics 


We use three classical measures to evaluate the prediction quality of the models tested. Assume the size of the test 
set is n; the actual target attribute values in the test set are ,< 22 ,...the predicted values on the test instances 
are pi ,p 2 , • • • The Mean Absolute Error (MAE) is ( [Witten efaL\\20l 1| ): 

\pi -ai|H- \-\Pn-an\ 


MAE = 


(9) 


When the relative rather than the absolute error values are more important, we use Mean Absolute Percentage 
Error (MAPE). 

I PI-QI I I \ P2-a2 \ I I \ Pn-an \ 

' ' X 100% (10) 


MAPE = 


ai 


We also show the Root Mean Squared Error (RMSE), which is more sensitive to outliers ( [Witten et al.\\20l ij ): 

= + + ( 11 ) 
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K 

Fig. 6: The value of feature x MaxRuntime in TLR-SC, when 2 <K <N,N — 15,20,25,30 


We use the R language environment for analysing our data and building predictive models, since it offers more 
flexibility in manipulating/visualising data. All our data and R code are available upon request for research pur¬ 
poses. 


6.2 Model TLR-SC 


In order to predict the SC we have first tested several feature combinations (e.g., N, K, Runtime, MaxRuntime) 
and several built-in regression algorithms of the open source machine learning software Weka ( [Witten et al.\ 
[201 Ij ). Nevertheless, straightforward application of WEKA algorithms did not work well for predicting the SC, in 
particular, for the scenario we are interested in: training on small input sizes and predicting/testing on large input 
sizes. Most WEKA algorithms delivered MAPE around 20%, when trained on data with A < 20 and tested on 
data with N > 40. 

In this section, we discuss several modelling approaches, and propose a first model for accurately predicting 
the SC, named TLR-SC (Transformed Linear Regression for Smoothed Complexity). The idea behind TLR-SC is 
to build new features that better capture the nature of the relationship between the SC and input data characteristics. 
For example, as shown in Figure for Quicksort, the SC has a nonlinear relationship with K and N. In order 
to capture this relationship, we create a new feature that directly couples K and MaxRuntime (which is also 

influenced by N). We empirically found that x MaxRuntime captures best this non-linear relationship. Using 
the new feature and linear regresison (the Im R package), MAPE reduces from 20% to 4.58%. Figurej^shows the 
value of feature x MaxRuntime, given 2 <K <N ,N — 10,15,20,25,30. 

Note that this feature is customized for Quicksort. As we showed earlier (Figure]^, different algorithms have 
different shapes of SC, thus feature vU MaxRuntime is not suitable for other algorithms. Therefore, in order 
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to build models for predicting the SC of other algorithms, we require a more general approach for capturing the 
relationship between SC and the features. 

6.2.1 Optimization of the Proposed Feature 
The new feature, expressed as {K' 0.5) 

MaxRuntime, generally captures the shape of how SC decreases, while 
K increases. We can optimize this result by parameterizing it. Using two parameters a and b, with initial value 
a = 0,b = —0.5, the new feature becomes 


{{KF a)^) X MaxRuntime (12) 

We vary the value of parameters a and b to find the best combination. We have found that for Im (with features 
{{K + 2.2)“^-^) X MaxRuntime, N and K) trained on data with 10 < A/^ < 20,2 < ^ < A/^ and tested on data of 
N — <K <N, the best combination is a — 2.2, b = —0.68, with Mean Absolute Error (MAE) of 7.04. 

Adding more parameters can improve the accuracy of our model, but may result in overfitting. Overfitting 
refers to fitting the training set very well, while failing to generalize to new test data. To check against this, we 
test the parameter combination on more test sets, where N = 40,60,80,2 < ^ < A, while the training set remains 
10 < N < 20,2 < K <A. The overall results show that the following parameter combination works well on all 
test sets a — 2.2,b — —0.1 and the final feature used in Im is 

{{KF 2.2)~^'^) X MaxRuntime (13) 


6.2.2 Results 

We work with two ground truth datasets for training and testing models for Quicksort. The first set contains data 
of 10 < A < 100,2 < K <N and N increases by 5. The second set contains data of 100 < A < 500,2 <K<N 
and N increases by 100. Tableshows some sample data from the first dataset. We denote by traiua-b a training 


Table 5: Sample ground truth data for Quicksort used in TLR-SC. 


N 

K 

MaxRuntime 

SC 

10 

2 

45 

39.7305 

10 

3 

45 

35.6077 

10 

4 

45 

32.4413 

10 

10 

45 

24.4373 

15 

2 

105 

91.8248 

15 

3 

105 

81.6957 

15 

4 

105 

73.8442 


set with a<N <b,2<K<N, testa-b a test set with a<N <b,2<K<N, and Ima-b, an Im model trained on 
traiUa-b- These notations are listed in TableEigure shows the predicted results of /mio -20 on test 4 o- 4 o, the 
values of data in tram 10 - 20 , and the true value of the SC of Quicksort in t^^t 4 o- 4 o. 

Table shows the MAE and MAPE of Im trained on different training sets and tested on various test sets. 
Generally speaking, the larger the training set, the better the test results, and the greater the N values in test set, 
the worse the prediction accuracy. Eor test set with small N values (e.g., A < 100), MAPE is around 3%. However, 
when tested on greater N values (e.g., A > 200), MAPE > 10%. 
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Table 6: Description of notation for TLR-SC. 


Term 

Description 

trairia-b 

A training SQi of a < N < b,2 < K < N 

Ima-b 

An Im model trained on traiua-b 

testa-b 

A test sot of a < N < b,2 < K < N 


Table 7; MAE and MAPE of the predicted results of TLR-SC. 


Model 

Error 

test4o-so 

test^o-io 

testgo-ioo 

test 200-200 

test300-300 

testsoo-soo 

o 

1 

o 

MAE 

7.85 

17.58 

38.83 

183.02 

422.28 

1175.83 

MAPE 

2.56% 

3.42% 

4.41% 

7.26% 

9.61% 

13.66% 

^^10-40 

MAE 

4.26 

12.49 

34.22 

200.44 

483.83 

1377.71 

MAPE 

1.33% 

2.44% 

3.88% 

7.85% 

10.91% 

15.99% 

o 

1 

o 

MAE 

2.73 

8.53 

28.80 

200.08 

497.88 

1442.20 

MAPE 

0.67% 

1.56% 

3.15% 

7.54% 

10.87% 

16.38% 

o 

oo 

1 

o 

MAE 

5.26 

4.98 

20.89 

185.03 

480.79 

1432.44 

MAPE 

1.47% 

0.71% 

2.23% 

6.n% 

10.24% 

15.98% 


Predicted Results of TLR-SC 


X 

CD 


o 

O 

"D 

CD 
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o 

o 
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Eig. 7: The predicted results of TLR-SC, trained on data of A = 10,15,20,2 < A < A, tested on data of A = 
40,2< A< A. 
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6.2.3 Discussion 

The main idea behind TLR-SC is to transfer a nonlinear regression problem into a linear one, by manipulating data 
and features. The advantage is that we can use a simple algorithm - linear regression, to handle a complex data 
shape in one step, and therefore the TLR-SC model is really simple. This model shows the importance of analyzing 
the data and understanding the relationship between features and prediction target. No matter how powerful the 
learning algorithm (e.g., WEKA algorithms), it cannot automatically solve the problem without human knowledge. 

Although the TLR-SC results are very encouraging, the accuracy drops quickly when the N value of the test set 
is large. One reason might be that the training is not enough, so the model cannot show its full ability in learning 
and predicting. It is difficult to expect it performing well on test sets with N > 300, when we only train it on data 
of A < 20. Besides, transforming the data and features to create a linear relationship may cause the prediction 
error of linear regression to be distorted ( [Motulsky| |2004| ). Another reason is that possibly other functions may 
work better than our choice . Most importantly, as discussed in Sectionj^ the relationship between SC and K is 
changing, given different values of N. The greater N is, the quicker SC decreases when K increases. Unfortunately, 
our customized feature does not capture this well. Although the value of feature {{K+a)^) x MaxRuntime changes 
with N, the parameters a and b do not. However, parameter b is the most important one in fitting the curve, and if 
we can make b change based on N, the performance of TLR-SC should improve. In the next section, we present a 
new model that implements this observation. 

Another disadvantage of TLR-SC is that it is hard to transfer to other algorithms. The feature ((^+2.2)“^-^) x 
MaxRuntime is created to fit the data shape of Quicksort. Other algorithms may have a totally different shape, 
therefore the model may not be applicable to other algorithms by solely changing the value of parameters. The 
biggest problem with the transition is that for other algorithms, we lack the necessary amount of data to analyze 
the data shape, so it is difficult to find the right function shape. 


6.3 Model NLR-SC 

In this section, we propose a new model for predicting the SC, NLR-SC (Non-linear Regression for Smoothed 
Complexity). This is an updated model, aimed at solving the problems of TLR-SC. We previously showed that 
by carefully capturing the nonlinear data shape, TLR-SC can dramatically increase the accuracy of prediction. 
Nevertheless, the relationship between SC and the perturbation parameter K is changing, for different values of 
N, and the fixed parameters a and b limit the ability of TLR-SC to capture this changing shape, as well as to be 
transferred to other sorting algorithms. 

To solve these problems, NLR-SC breaks down the surface-fitting problem into multiple curve-fitting prob¬ 
lems. By predicting the SC curve by curve, NLR-SC gradually predicts the whole surface. We deal with three type 
of curves in NLR-SC, first one, the curve of SC and K, for fixed N, is shown in blue in Ligure Because the 
shape between the SC and K changes for different N, we divide the surface into curves by N values, and predict 
these curves one by one. Lor instance, to predict the SC of 40 < A < 50,2 < ^ < A, we first predict the SC of 
A = 40,2 <K<N, then A = 41,2 < < A, then A = 42,2 < A < A, ..., until A = 50,2 < < A. Lor each 

N, we re-calculate the parameters in the fitting function, so that we capture the data shape of the SC. Lor this, we 
employ nls, an R model fitting library, which automatically determines the nonlinear least-squares estimates of the 
parameters, nls works well when the function shape is decided, but the parameters of the function are uncertain. 
We refer to nls predicting the first type of curve as sub model NLR-SC-N. 

By fixing K to 2 or N, we can get two more curves of the SC and N, shown in red and green in Ligure[^ When 
we try to use data of small N and K, to predict the SC of large N and K, these two curves are bridges between 
training data and the predicted target, and they define the starting point and the ending point of curve one. The 
reason for fixing K to 2 and N is that, according to the theory of smoothed analysis, we know that when K — 2, the 
curve of the SC and N follows the worst-case behaviour, while when K — N, the curve follows the average-case 
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Fig. 8: Three curves fitted in NLR-SC. Blue curve is the SC of fixed N and varying K; Red curve is the SC of 
varying N and fixed K = Green curve is the SC of varying N and fixed K = N. 


behaviour. Therefore, we can use two Im models to capture these two curves well, and supply NLR-SC-N with 
their prediction as a training set. We refer to this part as submodel NLR-SC-K. 

NLR-SC is the combination of NLR-SC-N and NLR-SC-K, shown in Figure By fixing either N or K, we 
simplify the data shape and make the hidden patterns explicit. 

6.3.1 NLR-SC: Fixed N 

We predict the SC of data of fixed N, using nls. The function we fit with nls is 

a{—-\-c)^Fd (14) 

where a, h, c, d are parameters to be fitted. Equationis inspired by the feature used in TLR-SC; 

X MaxRuntime (15) 

Compared to Equation[T^ in Eg nation [T?] we replace the MaxRuntime by a to increase the flexibility. In addition, 
we chose ^ instead of K in Equation [l4l because the maximum value of K depends on the value of N, and we 
need the nls model to work on all N values, therefore it is better to use the proportion of K to N, rather than the 
absolute value of K. We ran an experiment to examine the minimum training data that nls needs, to deliver good 
results. Figure [T^ shows how nls performs when trained on data of N — 100,2<K< 16 and A/^=100,2<K<6 
separately. Green spots in Figure [T^ are training data, red spots are test data, and the blue line is the predicted 
result. 

We see from Figure [T^ that the performance of nls is already very good with only 5 training examples. This 
is likely due to the fact that Equationfits the data well. The main reason why the accuracy decreases when we 
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NLS-SC 



Predictions of 
NLS-SC-K are 
training examples 
of NLS-SC-N 


Fig. 9: The structure of model NLR-SC. 


Training Set Size 15 


Training Set Size 5 




K K 

Fig. 10: Testing the size of the training set that nls needs to deliver good results, for N — 100,2 < ^ < 16 and for 

100,2 <7r< 6 


reduce the the training set is that the algorithm does not get enough information about where the SC will decrease 
to, in other words, what is the value of the SC when N — 100, ^ = A/^. By adding the data entry of = 100, A = A 
into the training set, the prediction accuracy is greatly improved. Shown in Figure with one extra data entry in 
the training set, nls easily finds the perfect fit to the curve. 

As shown in Figure[T^ examining data with A = 20,100,200,500, as long as the training data contains entries 
whose K — 2, and K — N, nls can find the perfect fit using only 5 training cases. Note that the size of the training 




























18 


Bichen Shi et al. 



K 

Fig. 11: By adding one important training example N = 100,^ = N, nls can find the perfect fit to the curve with 
only 5 training cases. 


set must be larger than the number of parameters in nls, and more parameters generally means higher prediction 
accuracy. 

Although both submodel NLR-SC-N of NLR-SC, and model TLR-SC, deal with the same nonlinear relation¬ 
ship, they take two different approaches. TLR-SC solves a surface fitting problem, using training data of small N 
and K, predicting the SC for large N and K, while NLR-SC-N solves a curve fitting problem, using training data 
of a specific N and some K values, predicting the SC of same N and other K values. The NLR-SC-N can be used 
to solve the same surface fitting problem as TLR-SC does, by dividing the surface into multiple curves by N, and 
solving them one by one. Because, for each N value, a new curve is constructed by nls, NLR-SC-N can capture 
the changing shape between the SC and K more accurately than TLR-SC, without causing overfitting. 

The only problem now is how to generate training data for NLR-SC-N. To predict the SC of a specific N value, 
nls needs at least 5 training examples of same N value, K — 2, K — N and other K values. If N is large, we cannot 
collect such training data through an experimental approach. Therefore, we use submodel NLR-SC-K to predict 
the required training data for NLR-SC-N. 

6.3.2 NLR-SC: Fixed K 

Submodel NLR-SC-K is created to generate the training data for nls in NLR-SC-N. As explained in the previous 
section, nls needs two important training data points, whith K — 2 and K — N,io determine the starting point 
and the ending point of the curve. As shown in Figure when K — 2, the SC of Quicksort increases while 
N increases, following the worst-case behavior 0{N^). Similarly, when K — N, the SC follows the average-case 
behavior 0{NlogN). Both experimental and theoretical results support this finding. 

These two patterns are so clear that we can use two Im models to fit the curves. One model focuses onK — 2: 
we use and N as features in Im, train it on data with small N, and test it on data with big N. Similarly, we fit 
another Im model for K = A: we use A x log(N) and A as features. Results are shown in Figureand Figure p~5| 
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N=20 


N=100 




0 20 40 60 80 


K 


K 


N=200 


N=500 



K 


K 


Fig. 12: Predicting the SC with 5 training examples, for N = 20,100,200,500, 2<K<N 


Besides K — 2 and K — N, the SC of K values close to 2 or N can also be predicted. For example, the Im model 
which is applicable for ^ = 2, is also applicable for ^ = 3,^ = 4, and K — 5. Similarly, the Im model which is 
applicable for K — N, is also applicable for K — N — l,K — N — 2, and K — N — 3. However, the accuracy of such 
predicted results might be compromised a little, and it is better to keep the value of K close to 2 and N. By using 
these two Im models, we can generate the necessary training cases for NLR-SC-N. 
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N 

Fig. 13: The SC of Quicksort when K — 2 and K — N, 5 < N < 100. 



N 


Fig. 14: Predicting the worst-case complexity of Quicksort, by using Im that was trained on data ofN<20,K = N 
and tested on data of N > 20, K = N 
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N 

Fig. 15: Predicting the average-case complexity of Quicksort, by using Im that was trained on data of N < 20, 
K — N and tested on data ofN>2(),K — N 


6.4 Combining the NLR-SC-N and NLR-SC-K 

NLR-SC is the combination of NLR-SC-N and NLR-SC-K. NLR-SC-N is used to predict the SC of fixed large N, 
using nls, and NLR-SC-K is used to produce training data for NLR-SC-N, using Im. 

In NLR-SC-N, for each N value we want to predict the SC for, we create a standalone nls model. Figure [T^ 
shows 3 nls models for predicting the SC of A/^ = 40,45,50, 6 <K <N—l. Each nls is trained on 5 training 
examples, marked as green stars, and these 3 nls models work one by one to predict the target surface of 40 < N < 
50, 6 < K < N — 1. In NLR-SC-K, for each K value we want to predict the SC for, we also create a standalone Im 
model. Figureshows 5 Im models for predicting the SC of 40 < A/^ < 50, K = 2,3,4,5,N. Each Im is trained 
on 3 training examples ofN — 10,15,20 and same K value as the prediction target, marked as green stars. These 
5 Im models also work one by one to predict the SC of 40 < A/^ < 50, K = 2,3,4,5, A/^. The most important point 
of the combination is that the training examples of NLR-SC-N are the predicted results of NLR-SC-K. In Figure 

the red stars representing the predicted result of NLR-SC-K, change to green in Figure [T^ indicating they are 
the training examples of NLR-SC-N. In other words, NLR-SC-N is built on the predicted results of NLR-SC-K. 

Figure [T^ shows the final combination: NLR-SC. When we consider NLR-SC-N and NLR-SC-K together, 
the training set of the whole model is data of 10 < A/^ < 20, K = 2,3,4,5,A/^, and the prediction target is data of 
40 < A^ < 50, 2 < K < N, marked as green and red stars in Figure p~8] This demonstrates the ability of NLR-SC 
to use training data of small N and K to predict the data of big N and K (as we will show, NLR-SC also works 
well for N = 3000). We give the detailed predicting process to explain the combination. As shown in Figure [T^ 
suppose the training data is the SC of Quicksort for N — 10,15,20, 2 < K < N, and the test data is the SC of 
Quicksort for N = 40,45,50, 2 <K < N. First, we filter the training data into a smaller dataset TrainK=2, by 
K — 2, N — 10, 15,20 and also filter the test data to a smaller dataset TestK= 2 , by K — 2, N — 40,45,50. Table[^ 
shows the data in TrainK =2 and TestK= 2 - Next, we train Im in NLR-SC-K on TrainK= 2 , using features and N, 








22 


Bichen Shi et al. 


NLR-SC-N 


Training Data of NLR-SC-N 
Test Data of NLR-SC-N 



Fig. 16; The range of training and test data of NLR-SC-N 

NLR-SC-K 


Training Data of NLR-SC-K 
Test Data of NLR-SC-K 



Fig. 17; The range of training and test data of NLR-SC-K 
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NLR-SC 


Training Data of NLR-SC 
Test Data of NLR-SC 



Fig. 18: The range of training and test data of NLR-SC, when combining NLR-SC-N and NLR-SC-K. 


and test it on TestK= 2 . We calculate the prediction error, and prepare the predicted results into different training 
sets for NLR-SC-N: TrainN= 4 o, Train^= 4 ^, and Train^=^Q, by N value. We repeat this process for ^ = 3,4,5 and 
K = N. Note that for K = N,wq change the features in Im to N x log{N) and N. 


Table 8: The training/test sets and the predicted results of NLR-SC-K, when K — 2. 


T rainK^2 

N 

K 

SC 

10 

2 

39.7305 

15 

2 

91.8248 

20 

2 

165.3564 


TestK^2 

N 

K 

SC 

40 

2 

673.8097 

45 

2 

854.5003 

50 

2 

1056.6209 


Predicted Result 

True SC 

Pred SC 

AbsError 

673.8097 

673.8558 

0.0461 

854.5003 

854.5739 

0.0736 

1056.6209 

1056.7293 

0.1084 


Now we have 3 training sets, TrainN= 4 o, Train^= 45 , and TrainN=so. Each training set contains 5 training 
examples, and we divide the test data into 3 test sets by N, and refer to them as TestN= 4 o, TestN= 45 , and TestN=5o- 
Table|^shows the data in TrainN= 4 o and TestN= 4 o- We train NLR-SC-N on TrainN =40 and test it on TestN= 40 - We 
repeat this process for the other 2 training and test sets. Prediction errors are calculated for each step and in total, 
as shown in Table [T^ This example demonstrates how the two sub-models are combined, as well as the ability 
of NLR-SC to predict the SC for large N, using very little training data. MAPE is less than 1%, which is much 
less than any model or algorithms that we have studied before. One of the reasons why this model performs very 
well is that it takes advantage of the theory of smoothed analysis. Because NLR-SC-K relies on the data patterns 
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Table 9: The training/test sets and the predicted results of NLR-SC-N, when N = 40. 


Predicted Result | 

True SC 

Pred SC 

AbsError 

441.1757 

441.4429 

0.2672 

408.0416 

408.4984 

0.4568 

380.4325 

381.1115 

0.6790 

191.6724 

191.5035 

0.1689 


T ramAr^40 

N 

K 

SC 

40 

2 

673.8558 

40 

3 

593.7202 

40 

4 

531.3032 

40 

5 

481.5961 

40 

40 

190.3537 


1 

N 

K 

SC 

40 

6 

441.1757 

40 

7 

408.0416 

40 

8 

380.4325 

40 

39 

191.6724 


Table 10: The MAE and MAPE of model NLR-SC, when tested on data of A = 40,45,50,2 < < A 


Error Table | 


MAE 

RMSE 

MAPE 

A = 40 

1.67 

1.99 

0.71% 

A = 45 

2.32 

2.74 

0.83% 

A = 50 

3.08 

3.62 

0.94% 

Average 

2.41 

2.92 

0.83% 


that relate to the worst-case and the average-case behaviour, it leads to very good prediction quality. Therefore 
NLR-SC-N, that is trained on the results of NLR-SC-K, also delivers good prediction results. 


6.5 Other Sorting Algorithms 

The biggest advantage of NLR-SC is that it can be applied to algorithms other than Quicksort. In this work we 
also present experiments with NLR-SC for MSQuicksort, optimized Bubblesort and Mergesort. 

We can explain NLR-SC in three steps. Lirst, it predicts the worst-case runtime of a given list length N. Then 
it predicts the average-case runtime of the same N. Linally, it predicts how the SC turns from worst-case runtime 
to the average-case runtime. Therefore, as long as we know the worst-case and average-case complexity of an 
algorithm and have some small amount of ground truth data, technically, the algorithm’s SC can be predicted by 
NLR-SC. Although the data shape of the SC of other algorithms is different to Quicksort’s, NLR-SC can easily 
adjust to this difference. All we need to do for the transition is to select the correct Im model to predict the worst- 
case and average-case complexity, which is dictated by the theory. Lor instance, unlike Quicksort, Mergesort’s 
worst-case and average-case complexity are both 0{Nlog{N)). Therefore, in NLR-SC-K, we simply replace the 
features of the Im models to A x log (A) and N, then the NLR-SC for predicting the SC of Mergesort can be used 
as is. 


7 Detailed Results of NLR-SC 

7.1 Quicksort 

We first train and test NLR-SC on Quicksort. We work with three sets of data. The first set contains data of 
10 < A < 100,2 < K < N and N increases by 5. The second set contains data of 100 < A < 500,2 < K < N 
and N increases by 100. The last set contains data of 600 < A < 3000,2 < K <N and N increases by 300. We 
define traina-b as a training set of <2 < A < b, testa-b as a test set of <2 < A < b, and nlsSclnia-b as a NLR-SC 
trained on traina-b- In addition, we define as the NLR-SC with t number of Im models in NLR-SC-K, 

which means, the nls in NLR-SC-N is trained on t number of training examples. Lor all NLR-SC examples we 
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discussed in previous sections, their t value is 5, i.e., they have 5 Im models in NLR-SC-K for K — 2,3,4,5,A/^. 
In our experiments we found that the higher the t values, the more accurate the prediction. These terms are also 
listed in Table [TT] 


Table 11: Description of terms used in NLR-SC. 


Term 

Description 

traiua-b 

A training SQiof a <N <b,2 < K <N 


A NLR-SC with t Im models in NLR-SC-K, trained on traiUa-b 

testa-b 

A test SQt of a < N < b,2 < K < N 


Table 12: MAE and MAPE of the predicted results of NLR-SC for Quicksort. 


Model 

Error 

^^5^40-50 

^^5^90-100 

test3oo-3oo 

testsoo-soo 

testi 2 k-l. 2 k 

test3k-3k 

nls&lmlQ_2Q 

MAE 

2.41 

13.93 

116.20 

257.05 

864 

2646.65 

MAPE 

0.83% 

1.68% 

2.99% 

3.44% 

3.91% 

4.04% 

nls&lmlQ_^Q 

MAE 

2.73 

16.65 

143.38 

321.49 

1113.69 

3560.26 

MAPE 

0.96% 

2.08% 

3.83% 

4.45% 

5.18% 

5.50% 


MAE 

2.87 

16.82 

150.57 

342.33 

1209.75 

4941.15 

MAPE 

1.03% 

2.10% 

4.07% 

4.82% 

5.76% 

6.26% 

nls&lm\Q_2Q 

MAE 

2.15 

12.44 

103.00 

225.93 

741.59 

2172.66 

MAPE 

0.75% 

1.52% 

2.70% 

3.09% 

3.48% 

3.52% 

nls&lm\Q_2Q 

MAE 

1.93 

11.24 

92.67 

201.82 

647.85 

1879.58 

MAPE 

0.68% 

1.39% 

2.48% 

2.83% 

3.15% 

3.14% 

nls&lm\Q_2Q 

MAE 

1.75 

10.24 

84.26 

182.42 

582.50 

1699.65 

MAPE 

0.62% 

1.28% 

2.29% 

2.61% 

2.89% 

2.86% 


Tableshows the MAE and MAPE of NLR-SC trained on different training sets, tested on various test sets. 
The first model nls 8 clm\Q_ 2 Q is trained on data of A = 10,15,20, with 5 Im models in NLR-SC-K. The second 
model n/^&/mjQ_gQ is similar to the first one, but trained on bigger dataset of 10 < A < 80, and so on. The 
n/^&/mjQ_ 2 o model, on the other hand, has the same training set as the first model, but has 6 Im models in NLR- 
SC-K. Based on the test results, we can see that simply increasing the size of the training data for NLR-SC, does 
not improve the accuracy, but increasing the number of Im models in NLR-SC-K, which also means increasing 
the training examples of NLR-SC-N, improves the accuracy. 

Comparing to Tablej^ Table[^shows that NLR-SC has a much better prediction accuracy than TLR-SC. The 
average MAPE of NLR-SC on any test set is around 0.5% to 3%. Unlike TLR-SC, the error does not increase 
quickly when the size of the test set increases. Due to the computational requirements of modular smoothed 
analysis, the maximum test data we can collect through the modular approach is A = 3000, and NLR-SC performs 
well on it, with around 4% MAPE. Based on current test results, we expect NLR-SC to perform well on even larger 
A, but we currently lack the ground truth data to showcase this. 

The most encouraging point of the test results is that we only use very few training examples to achieve such 
accurate prediction results. Lor model nlsMm\Q_ 2 Q, the number of training examples it trained on is only 15. It 
means that if we apply NLR-SC to other algorithms, and only use the data collected through an experimental ap- 
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proach, we can still predict the SC of N greater than 1,000. To our best knowledge, for now, only modular smoothed 
analysis can achieve that, and only for a restricted class of sorting algorithms (Quicksort and M3 Quicksort) so far. 


7.2 M3 Quicksort 

We also applied NLR-SC to other algorithms. Table[^shows MAE and MAPE of the prediction for M3Quicksort, 
for which we collected data through the modular approach. Note that the recurrence equation of modular smoothed 
analysis for M3Quicksort only works for K>4, therefore, we do not have the SC of ^ = 2, and K — 3. And because 
of the factorial calculation in the formula, the maximum data we can collect is A < 130, 4 < K < N. Prediction 
on higher N values cannot be validated due to the lack of ground truth data. 

NLR-SC relies on the data of ^ = 2 and ^ = 3 to ensure the accuracy of NLR-SC-K, as well as the whole 
model, but the modular approach formula of M3 Quicksort cannot supply them. We have to make NLR-SC work 
on data of K — 4,5,6,7 instead, and use Train\^-25 instead of Trainio-io, because when N — 10,^ = 7, the SC 
is not at all the worst-case behavior, while when N — 15,^ = 7, the SC is comparably closer to the worst-case 
behavior. This is why nls&lmlQ_^Q gives a better result than nls&lml^_ 2 y for A = 40,^ = 7, the SC is closer 
to the worst-case behaviour than for A = 15,^ = 7. Besides, similar to the Quicksort results, adding in more Im 
models in NLR-SC-K (i.e. more training examples in NLR-SC-N) can improve the accuracy. 


Table 13: MAE and MAPE of the predicted results of NLR-SC on M3Quicksort. 


Model 

Error 

test4o-so 

testgo-ioo 

nls8Llm\^_2^ 

MAE 

1.69 

12.35 

MAPE 

0.72% 

1.82% 

nls8Llm\^_2^ 

MAE 

1.52 

12.36 

MAPE 

0.64% 

1.82% 

nls&lm^Q_^Q 

MAE 

1.39 

9.06 

MAPE 

0.61% 

1.45% 


MAE 

1.17 

8.37 

MAPE 

0.52% 

1.34% 


7.3 Optimized Bubblesort 

We also test NLR-SC on optimized Bubblesort, for which we collected data through the experimental approach. 
The size of the dataset is 50, with 3 < A < 15 and 2 < K < N. Due to the limited computation power, the higher 
the N value, the less data we can collect, e.g., for A = 15 we can only collect ground truth data up to A = 4. 

The original Bubblesort’s worst-case runtime is equal to its average-case runtime. The optimized Bubblesort 
that we used in this paper has better performance on part of the input lists, but its average-case complexity remains 
(9(A^), which is different from Quicksort. The good news is that NLR-SC can easily adjust to such difference. 
By simply replacing the feature A x log (A) in Im models to A^, NLR-SC is able to capture the SC of algorithms 
whose average-case and worst-case complexity are both 0{N^). Considering the limited amount of the ground 
truth data, we reduce the number of Im models as well as the number of parameters in nls to 4. Thus, NLR-SC 
only uses 12 training examples, that follow either worst-case or average-case behavior. 
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(a) Bubblesort (b) Mergesort 

Fig. 19: Plot of the predicted results of NLR-SC on Bubblesort and Mergesort. The model is trained on experi¬ 
mental data, which are marked as green spots. 


Table [T^ shows the error of the predicted results. Due to the shortage of ground truth data, we only test NLR- 
SC on a test set of N — 9 and N — 10. The predicted results are fairly good, with MAE of 0.05 and MAPE of 
0.16%. 


Table 14: MAE and MAPE of the predicted results of NLR-SC on optimized Bubblesort. 


Model 

Error 

Testg-g 

o 

1 

o 


MAE 

0.05 

0.07 

MAPE 

0.16% 

0.17% 


Figure p^ shows the plot of the predicted results. The green spots are the experimentally collected data, while 
red stars are predicted results of NLR-SC, where the real values of these data are unknown and cannot be collected 
through an experimental approach. We observe from the plot that the predicted results seem to agree with the trend 
of the experimental data, but have no ground truth data to numerically validate them. 


7.4 Mergesort 

Similar to Bubblesort, the data of Mergesort is collected through the experimental approach. The total number of 
examples is 50, with data of 3 < A < 15 and 2 < K <N. Both the worst-case complexity and the average-case 
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complexity of Mergesort is 0{Nlog{N)), therefore, we change all features in Im models to N x log{N). Similar 
to Bubblesort, because of the limited ground truth data, we reduce the number of Im models as well as the number 
of parameters in nls to 4. Tested onN — 9 and N = 10, the error of the predicted results of NLR-SC for Mergesort 
are shown in TableFigure p^b] shows the plot of the predicted results. Although the data shape of Mergesort 
is different to Bubblesort, the SC prediction accuracy is still high. Figureshows the predicted SC of Quicksort, 


Table 15: MAE and MAPE of the predicted results of NLR-SC on Mergesort. 


Model 

Error 

Testg—g 

Test\Q-\Q 

nls8Llm\_^ 

MAE 

0.05 

0.05 

MAPE 

0.20% 

0.22% 


MSQuicksort, optimized Bubblesort and Mergesort given by NLR-SC, for A < 60, 2 < ^ < N. 


8 Conclusion 

In this work, we present two regression models for predicting the Smoothed Complexity (denoted as SC) of sort¬ 
ing algorithms. The first model, TLR-SC, uses linear regression to predict the complex data surface of the SC. It 
transfers a nonlinear relationship into a linear one, by transforming the original features. TLR-SC has a simple 
structure and is very efficient, although it is aimed at Quicksort and is not easily transferable to other sorting algo¬ 
rithms. The second model, NLR-SC, takes an iterative approach. It divides the surface fitting problem into multiple 
curve fitting problems and by predicting curves one by one, it gradually predicts the entire surface. NLR-SC takes 
advantage of the theory of the smoothed analysis, which leads to its very good performance. With 15 training ex¬ 
amples, NLR-SC can predict the SC of Quicksort for A > 3000, with around 3% Mean Absolute Percentage Error. 
The rules NLR-SC relies on are general and grounded on the theory of SC; when the list perturbation parameter 
K approaches 1, the SC of any sorting algorithm follows the worst-case behavior, and when K approaches N, the 
SC follows the average-case behavior. Therefore, technically, as long as we know the worst-case and average-case 
complexity of a sorting algorithm, and have a small amount of training examples, we can use NLR-SC to predict 
the SC of the algorithm. 

In this work we study the SC of four sorting algorithms and show that our prediction models deliver high 
quality results. A large part of our study focuses on data collection and analysis, since to start with, there is 
very limited availability of ground truth data, to test our prediction models. By getting a good understanding 
of the data shape and the relationship between features and the SC, we construct two predictive models that 
work very well with limited training data, and show that the shortage of data can be made up by the adequate 
background knowledge, in our case provided by the theory of SC. Our work fills the gap between known theoretical 
and empirical results on the behaviour of the Smoothed Complexity of sorting algorithms. By taking a machine 
learning approach, we build predictive models that are scalable for large input sizes, therefore advancing currently 
existing techniques. In the future, we plan to further analyze the potential of our prediction models, by studying 
other (than sorting) type of algorithms working on discrete data. 
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(a) Quicksort 


(b) MSQuicksort 




(c) Bubblesort (d) Mergesort 

Fig. 20: Plot of the predicted SC of 4 sorting algorithms, for N < 60,2 <K<N. 

References 

Arthur, D., Manthey, B., and Roglin, H. (2009). k-means has polynomial smoothed complexity. In Foundations 
of Computer Science, 2009. FOCS’09. 50th Annual IEEE Symposium on, pages 405-414. IEEE. 









30 


Bichen Shi et al. 


Banderier, C., Beier, R., and Mehlhorn, K. (2003). Smoothed analysis of three combinatorial problems. In 
Mathematical Foundations of Computer Science 2003, pages 198-207. Springer. 

Blum, A. and Dunagan, J. (2002). Smoothed analysis of the perceptron algorithm for linear programming. In 
Proceedings of the thirteenth annual ACM-SIAM symposium on Discrete algorithms, SODA ’02, pages 905- 
914, Philadelphia, PA, USA. Society for Industrial and Applied Mathematics. 

Deshpande, A. and Spielman, D. A. (2005). Improved smoothed analysis of the shadow vertex simplex method. 
In Foundations of Computer Science, 2005. FOCS 2005. 46th Annual IEEE Symposium on, pages 349-356. 
IEEE. 

Hastie, T., Tibshirani, R., and Eriedman, J. (2013). The Elements of Statistical Learning. Springer Series in 
Statistics. 

Hennessy, A. and Schellekens, M. (2014). Modular smoothed analysis of median-of-three quicksort. Technical 
Report, University College Cork, Ireland. Submitted to Discrete Mathematics. 

Knuth, D. E. (1973). The Art of Computer Programming, Volume III: Sorting and Searching. Addison-Wesley. 

Motulsky, H. (2004). Fitting models to biological data using linear and nonlinear regression: a practical guide 
to curve fitting. OUP USA. 

Schellekens, M. (2008). A Modular Calculus for the Average Cost of Data Structuring: Efficiency-Oriented 
Programming in MOQA. Springer. 

Schellekens, M., Hennessy, A., and Shi, B. (2014). Modular smoothed analysis. Technical Report, University 
College Cork, Ireland. Submitted to Discrete Mathematics. 

Shi, B. (2013). A Machine Learning Approach For Estimating The Smoothed Complexity Of Sorting Algorithms. 
Master’s thesis. University College Cork, Cork, Ireland. 

Spielman, D. and Teng, S.-H. (2001). Smoothed analysis of algorithms: Why the simplex algorithm usually takes 
polynomial time. In Proceedings of the thirty-third annual ACM symposium on Theory of computing, pages 
296-305. ACM. 

Spielman, D. A. and Teng, S. (2006). Smoothed analysis of algorithms and heuristics. LONDON MATHEMATI¬ 
CAL SOCIETY LECTURE NOTE SERIES, 331, 274. 

Spielman, D. A. and Teng, S.-H. (2002). Smoothed analysis of algorithms. ArXiv Mathematics e-prints. 

Spielman, D. A. and Teng, S.-H. (2009). Smoothed analysis: an attempt to explain the behavior of algorithms in 
practice. Communications of the ACM, 52(10), 76-84. 

Witten, I. H., Prank, E., and Hall, M. A. (2011). Data Mining: Practical Machine Learning Tools and Techniques. 
Morgan Kaufmann. 



